This presentation focuses on predicting two variables in our Beer Sales data: Channel and Spend Per Trip.
For our Channel predictions, we’ve built Random Forest and Support Vector models with an accuracy of ~0.6. This seems good at first, but we’re having difficulties getting the model to predict cases where the channel is equal to “food” (i.e., Grocery Stores). We believe continued development of these models will rectify this issue.
For our Spend Per Trip predictions, we’ve built Lasso and Neural Network models. Both models have an R squared of 0.8 and 0.85 respectively, which shows the models are doing a good job of predicting the our customer’s spend based on the attributes of their trip.
| model | accuracy | sensitivity | specificity | precision | rmse | mae | rsq | r2 |
|---|---|---|---|---|---|---|---|---|
| Regression Neural Network | NA | NA | NA | NA | 3.03 | 0.96 | 0.86 | 0.86 |
| Classification Random Forest Tuned | 0.61 | 0.07 | 0.98 | 0.65 | NA | NA | NA | NA |
| Classification Support Vector | 0.60 | 0.02 | 0.99 | 0.59 | NA | NA | NA | NA |
| Regression Lasso Tuned | NA | NA | NA | NA | 3.35 | 1.05 | 0.81 | 0.81 |
This project is focused on a beer sales dataset that captures detailed customer transactions, specifically focusing on beer purchases from October 2013 to October 2014. My primary goal is to develop models that effectively identify patterns within these transactions.
For the categorical variable I want to predict, I am going with “channel”, which indicates where the food was purchased. I am filtering this to two channels, “food” (i.e., grocery stores) and “mass” (i.e., Walmart, Target, Costco).
For the continuous variable, I’m trying to predict the “spend_per_trip”. This indicates how much the customer spent in total on their trip to the store.
The raw dataset has 184,973 rows and 12 columns, there are two ID columns which will not be used for my models.
The categorical columns all have a high cardinality so I filtered the two largest sales channels, food and mass. This resulted in a dataset with 149,718 rows and 12 columns
The initial dataset contains multiple rows if a customer bought more than 1 type of item on their trip. Due to this, I created two datasets to capture information about trips and customers.
DU MSBA Alumni Andrew Brooks
TO PREDICT WITH
* Trip Date: The date of the User’s trip.
* Retailer: The name of the venue where the sale takes place (e.g., Walmart, Kroger, and Safeway).
* Banner: The sub-category of Retailer, if there’s a subtype available.
* Parent Brand: Name of the parent company that is selling the beer (e.g., Bud Light, Miller, and Coors).
* Brand: Name of the brand of beer that is being sold.
* Item Description: The type of item that is being sold, like a 12 or 24 pack.
* Units Per Trip: How many of the item were bought.
* Spend Per Unit: How much money was spent per unit that was bought.
* Basket Size (Units): How many items were in the User’s basket on that trip.
* Basket Size ($): The monetary value of the User’s basket on the trip
WE WANT TO PREDICT
* Channel: The type of venue where the trip takes place (e.g., Bodega, Fast Food, Drug Store).
* Spend Per Trip: How much money was spent on the trip on Beer.
| channel | units_per_trip | spend_per_unit | spend_per_trip | basket_size_units | basket_side_dollars | trip_date_quarter | trip_date_month | is_weekend | is_weekday | season | day_of_week | spend_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 165998 | mass | 1 | 16.64 | 16.64 | 3 | 74.77 | 2 | 4 | 1 | 0 | Spring | Sunday | High |
| 153947 | mass | 1 | 11.99 | 11.99 | 21 | 125.89 | 3 | 7 | 0 | 1 | Summer | Thursday | Medium |
| 128031 | mass | 1 | 4.77 | 4.77 | 17 | 69.37 | 1 | 2 | 1 | 0 | Winter | Saturday | Low |
| 154524 | mass | 1 | 6.19 | 6.19 | 25 | 177.75 | 1 | 2 | 1 | 0 | Winter | Saturday | Low |
| 57586 | food | 1 | 7.49 | 7.49 | 21 | 68.68 | 2 | 5 | 1 | 0 | Spring | Saturday | Low |
| 82002 | mass | 1 | 15.97 | 15.97 | 12 | 74.83 | 4 | 11 | 1 | 0 | Fall | Saturday | Medium |
In this decision tree model, we’re trying to determine which of the two largest channels a consumer purchased their beer at. An accuracy of 0.6 reflects some positive initial results of this model’s performance, but looking at the sensitivity it’s doing a poor job of predicting when a row was at a “mass” channel. A sensitivity of 0.976 tells me that it’s doing really well at the “food” channel, but it also tells me the model is just picking “food” for most rows. For next steps, I’d say we could look at increasing the importance of the “mass” prediction in our model training.
| .metric | .estimate |
|---|---|
| accuracy | 0.613 |
| sensitivity | 0.067 |
| specificity | 0.976 |
| precision | 0.649 |
Truth
Prediction mass food
mass 1615 874
food 22332 35067
The performance between this Support Vector Classifier and the Random Forest model are remarkably similar, with a slight loss in precision for the SVC model.
| .metric | .estimate |
|---|---|
| accuracy | 0.602 |
| sensitivity | 0.018 |
| specificity | 0.992 |
| precision | 0.590 |
Truth
Prediction mass food
mass 435 302
food 23512 35639
In this model we’re trying to predict the total spent on a trip by a customer using a Lasso model.
[1] "The lowest rmse Lasso penalty is 0.01"
| model | accuracy | sensitivity | specificity | precision | rmse | mae | rsq | r2 |
|---|---|---|---|---|---|---|---|---|
| Classification Random Forest Tuned | 0.613 | 0.067 | 0.976 | 0.649 | NA | NA | NA | NA |
| Regression Lasso Tuned | NA | NA | NA | NA | 3.353 | 1.047 | 0.811 | 0.811 |
| Classification Support Vector | 0.602 | 0.018 | 0.992 | 0.590 | NA | NA | NA | NA |
| Regression Neural Network | NA | NA | NA | NA | 3.032 | 0.958 | 0.856 | 0.856 |
Classification Models Both models suffer greatly in the sensitivity measure. This indicates that these models are struggling to classify the “mass” channels, while most cases just predicting “food”, thus the high specificity.
Regression Models Both models perform similarly, but I would lean towards using the Random Forest model for it’s slightly superior performance
| model | accuracy | sensitivity | specificity | precision | rmse | mae | rsq | r2 |
|---|---|---|---|---|---|---|---|---|
| Regression Neural Network | NA | NA | NA | NA | 3.03 | 0.96 | 0.86 | 0.86 |
| Classification Random Forest Tuned | 0.61 | 0.07 | 0.98 | 0.65 | NA | NA | NA | NA |
| Classification Support Vector | 0.60 | 0.02 | 0.99 | 0.59 | NA | NA | NA | NA |
| Regression Lasso Tuned | NA | NA | NA | NA | 3.35 | 1.05 | 0.81 | 0.81 |
Our Spend Per Trip models are doing a good job of predicting our customers Spend Per Trip, both with r squared values above 0.80. I would recommend continuing to explore these models and try to understand which predictors have the greatest impact on the model’s predictions.
Our Channel models present a greater challenge. There is a slight class imbalance, but our model’s specificity (predicting “food”) is very low, usually around or below 0.05. This means the models are primarily just predicting that the Channel is equal to “mass” in the vast majority of cases. For next steps, I would exactly split the training dataset 50/50 for mass and food and see the results. Alternatively, you could look at methods for increasing the importance of the “food” predictions and see if the model adjusts.
I am most proud of the work I did to integrate my old Python code into this project. I see this as laying the groundwork for incorporating R in my future Python projects. Working through this class, I really like a lot of R’s features and syntax and I could see myself working with R in a project for data prep and exploration.
If given another week, I would spend the time on refining the four predictive models in this report. I certainly felt a lot of crunch with the capstone going on at the same time as well as my full-time job.